Drought Data Analysis¶

Table of Contents¶

  • Introduction
  • Data Wrangling
  • Exploratory Data Analysis
  • Conclusions

Introduction¶

In this project, we shall be conducting a descriptive analysis of drought disasters in Africa, to understand the magnitude and impact of drought across various dimensions.

About the Dataset:¶

The EM-DAT database records mass disasters as well as their health and economic impacts at a country level. The database contains core data on the occurrence and effects of 26,000 disasters worldwide from 1900 to the present, and is managed and distributed by the Centre for Research on the Epidemiology of Disasters (CRED). The database is compiled from various sources of information, including UN agencies, non-governmental organizations, insurance companies, research institutes, and press agencies. The dataset used for this project was filtered specifically for drought disasters in Africa, and consists of 168 observations of drought disasters between 1999 and 2022. A detailed documentation of the data and glossary of each feature can be found on the EM-DAT website

Research Questions:¶

This descriptive analysis aims to uncover the following questions about drought as a natural disaster in Africa:

Research Questions:¶

1. Who were the worst hit?¶

what is their distribution per:

  • country
  • region
  • income group

Research Questions:¶

2. What are the effects of drought?¶
  • direct
  • secondary

Data Wrangling¶

Note: The original data has been cleaned in a separate notebook which can be found here

We shall use this cell to import necessary libraries and packages that we need to analyse this data:

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
import plotly.express as px
%matplotlib inline
In [2]:
# If we do not currently have a library or package installed on our current device, we can quickly install them using pip as shown below:

# UNCOMMENT NEXT LINE TO INSTALL PLOTLY LIBRARY. This requires an internet connection. 
# You don't have to run it next time once you have the library installed. Please kindly wait for the install to complete.

# !pip install plotly

Now, let's load the data which we cleaned earleir on and saved as "drought_data_cleaned.csv". We would also want to ensure that the data are in the right types as we load them to this notebook

In [4]:
df = pd.read_csv('drought_data_cleaned.csv',
                dtype = {
                    "Total Affected": 'int64',
                    "Country": 'category',
                    "Subregion": 'category',
                    "Associated Types": 'category',
                    "OFDA Response": 'category',
                    "Appeal": 'category',
                    "Declaration": 'category',
                    "Start Year": 'category',
                    "Start Month": 'category',
                    "End Year": 'category',
                    "End Month": 'category',
                    "UN Sub Region": 'category',
                    "Income group": 'category',                    
                })

Let's print out a few lines of the table to see if it was correctly loaded. In this case, we shall print out only the first five lines.

In [5]:
df.head(n=5)
Out[5]:
Total Affected Country Subregion ISO Origin Associated Types OFDA Response Appeal Declaration Start Year Start Month End Year End Month UN Sub Region Income group
0 100000 Djibouti Sub-Saharan Africa DJI Not specified Not specified Yes No No 2001 6.0 2001 0.0 Eastern Africa Middle Income
1 2000000 Sudan Northern Africa SDN Not specified Food shortage|Water shortage No No No 2000 1.0 2001 0.0 Northern Africa Middle Income
2 1200000 Somalia Sub-Saharan Africa SOM Not specified Food shortage No No No 2000 1.0 2001 0.0 Eastern Africa Low Income
3 231290 Madagascar Sub-Saharan Africa MDG Not specified Not specified No No No 2000 6.0 2000 0.0 Eastern Africa Low Income
4 0 Burkina Faso Sub-Saharan Africa BFA Not specified Not specified No No No 2001 4.0 2001 0.0 Western Africa Low Income

The code cell below will provide additional information about the data, such as the number of columns, how many entries are missing in each column, and the data type of the column.

In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 168 entries, 0 to 167
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   Total Affected    168 non-null    int64   
 1   Country           168 non-null    category
 2   Subregion         168 non-null    category
 3   ISO               168 non-null    object  
 4   Origin            168 non-null    object  
 5   Associated Types  168 non-null    category
 6   OFDA Response     168 non-null    category
 7   Appeal            168 non-null    category
 8   Declaration       168 non-null    category
 9   Start Year        168 non-null    category
 10  Start Month       168 non-null    category
 11  End Year          168 non-null    category
 12  End Month         168 non-null    category
 13  UN Sub Region     168 non-null    category
 14  Income group      168 non-null    category
dtypes: category(12), int64(1), object(2)
memory usage: 11.1+ KB

Exploratory Data Analysis¶

Now that we've trimmed and cleaned our data, we're ready to move on to exploration. It's now time to compute statistics and create visualizations with the goal of addressing the research questions that we posed in the Introduction section. We shall be systematic with our approach, by looking at one variable at a time, and then following it up by looking at relationships between variables.

Research Question 1: Who were the worst hit?¶

Distribution per region

The code below looks at the "UN Sub Region" column of our data, to tell us how many unique entries are represented. In this case, there are 5 Sub Regions represented, following the UN regional classification.

In [6]:
df['UN Sub Region'].unique()
Out[6]:
['Eastern Africa', 'Northern Africa', 'Western Africa', 'Middle Africa', 'Southern Africa']
Categories (5, object): ['Eastern Africa', 'Middle Africa', 'Northern Africa', 'Southern Africa', 'Western Africa']

Now, let's find out how many persons were affected in each subregion for the period under review. Note that the first line of code was separated by the "=" sign. To the left of the "=" sign is the variable, which is like a container that stores the value of our operation. Then, to the right is the actual Python function that performs the operation. The name of the variables have been made descriptive enough to give an idea of the content we want it to store for us.

In [7]:
count_by_subregion = df.groupby("UN Sub Region").sum().reset_index()
count_by_subregion
C:\Users\DELL\AppData\Local\Temp\ipykernel_8556\158047414.py:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  count_by_subregion = df.groupby("UN Sub Region").sum().reset_index()
Out[7]:
UN Sub Region Total Affected
0 Eastern Africa 227248956
1 Middle Africa 44340003
2 Northern Africa 17839300
3 Southern Africa 38461515
4 Western Africa 86394334

The cell above shows that Eastern Africa was the worst hit in terms of the number of persons affected by drought. Let's see how they compage with other regions. We shall first compute the percentage, then plot a simple pie chart.

In [3]:
# compute percentages for each subregion
count_by_subregion['Percentage'] = (count_by_subregion['Total Affected'] / sum(count_by_subregion['Total Affected'])) * 100
count_by_subregion

The code cell below plots a donut chart for us using the plotly library:

In [9]:
template = "ggplot2"
count_by_subregion_donut = px.pie(data_frame=count_by_subregion, 
       values='Total Affected', 
       width = 600,
       height = 600,
       names = "UN Sub Region", 
       title="Number of persons affected per UN Subregion",
       hole= 0.7,
       template = template,
)
In [ ]:
count_by_subregion_donut.update_layout(
    legend=dict(x=1, y=0))

OBSERVATION: The visual above shows that 54.9% of person affected come from eastern africa. This means that one in every two persons affected by drought in africa is from eastern africa.

Distribution by Country

Now, going more specifically, let's see which countries were most hit:

In [14]:
count_by_country = df.groupby("Country").sum().sort_values(by='Total Affected', ascending=False).reset_index()
count_by_country["Percentage of Total Affected"] = count_by_country["Total Affected"]/ count_by_country["Total Affected"].sum() * 100
count_by_country
C:\Users\DELL\AppData\Local\Temp\ipykernel_8556\1282752396.py:1: FutureWarning:

The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.

Out[14]:
Country Total Affected Percentage of Total Affected
0 Ethiopia 74705679 18.032475
1 South Africa 30450000 7.350028
2 Kenya 29750000 7.181062
3 Niger 29303986 7.073403
4 Somalia 27535624 6.646556
5 Democratic Republic of the Congo 25972806 6.269322
6 Zimbabwe 21135118 5.101600
7 Malawi 19727628 4.761860
8 Nigeria 19110398 4.612873
9 Sudan 17839300 4.306055
10 South Sudan 15623670 3.771245
11 Mali 13660753 3.297436
12 Burkina Faso 13250928 3.198512
13 Mozambique 10262271 2.477110
14 United Republic of Tanzania 8954000 2.161319
15 Chad 8822162 2.129496
16 Mauritania 8205374 1.980615
17 Madagascar 5665290 1.367489
18 Angola 4922216 1.188126
19 Zambia 4210000 1.016211
20 Lesotho 3608515 0.871024
21 Uganda 3542000 0.854969
22 Burundi 2412500 0.582330
23 Cameroon 2401127 0.579585
24 Namibia 2261000 0.545761
25 Central African Republic 2221692 0.536273
26 Eswatini 2104000 0.507864
27 Senegal 2093702 0.505378
28 Eritrea 1700000 0.410346
29 Djibouti 1025176 0.247457
30 Rwanda 1000000 0.241380
31 Gambia 491100 0.118542
32 Cabo Verde 146093 0.035264
33 Guinea-Bissau 132000 0.031862
34 Botswana 38000 0.009172

Let's visualize this as a bar chart:

In [15]:
count_by_country_top_15 = count_by_country.nlargest(columns='Percentage of Total Affected', n=15)
fig_count_by_country_top_15 = px.bar(data_frame = count_by_country_top_15,
       x='Percentage of Total Affected', 
       y='Country',
       text= count_by_country_top_15["Percentage of Total Affected"],
                                     template = 'ggplot2',
 )
In [ ]:
fig_count_by_country_top_15.show()

OBSERVATION: From the above analysis, the following countries were the most hit in terms of the number of persons affected by drought:

  • Ethiopia
  • South Africa
  • Kenya
  • Niger
  • Somalia

Interestingly, top 3 of the top 5 countries most Affected were still eastern African countries. Why exactly is Eastern Africa most affected by drought? Could it be a geographical factor or as a result of poor emergency response on the part of the government?

This question calls for further investigations as our data cannot provide an answer to it!

Now, let's visualize this as a map and see how the impact of drought is distributed across the continent. We shall start with a basic, empty map.

The map below only shows African countries represented in our data. Maps in grey imply that we do not have data for those countries.

In [21]:
basic_map = px.choropleth(data_frame=df, locations="ISO", 
                          locationmode="ISO-3", 
                          scope='africa', 
                          color='UN Sub Region',
            )
basic_map.show()

Now that we know how to plot a map, let's feed the map with actual data to see where each country belongs in terms of the number of persons affected by drought.

In [24]:
map_df = df.groupby('Country').sum().reset_index()  #creates a table consisting of countries and their corresponding Total of Affected persons
map_df
C:\Users\DELL\AppData\Local\Temp\ipykernel_8556\2456005708.py:1: FutureWarning:

The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.

Out[24]:
Country Total Affected
0 Angola 4922216
1 Botswana 38000
2 Burkina Faso 13250928
3 Burundi 2412500
4 Cabo Verde 146093
5 Cameroon 2401127
6 Central African Republic 2221692
7 Chad 8822162
8 Democratic Republic of the Congo 25972806
9 Djibouti 1025176
10 Eritrea 1700000
11 Eswatini 2104000
12 Ethiopia 74705679
13 Gambia 491100
14 Guinea-Bissau 132000
15 Kenya 29750000
16 Lesotho 3608515
17 Madagascar 5665290
18 Malawi 19727628
19 Mali 13660753
20 Mauritania 8205374
21 Mozambique 10262271
22 Namibia 2261000
23 Niger 29303986
24 Nigeria 19110398
25 Rwanda 1000000
26 Senegal 2093702
27 Somalia 27535624
28 South Africa 30450000
29 South Sudan 15623670
30 Sudan 17839300
31 Uganda 3542000
32 United Republic of Tanzania 8954000
33 Zambia 4210000
34 Zimbabwe 21135118

Next, let's create a temporary dataframe containing some columns of our main dataframe so that we can join them to the aggregated dataframe for the map

In [25]:
temp_df = df[['Country', 'UN Sub Region', 'Income group', 'ISO']]
temp_df = temp_df.drop_duplicates()
temp_df
Out[25]:
Country UN Sub Region Income group ISO
0 Djibouti Eastern Africa Middle Income DJI
1 Sudan Northern Africa Middle Income SDN
2 Somalia Eastern Africa Low Income SOM
3 Madagascar Eastern Africa Low Income MDG
4 Burkina Faso Western Africa Low Income BFA
5 Mali Western Africa Low Income MLI
6 Niger Western Africa Low Income NER
7 Chad Middle Africa Low Income TCD
8 Mozambique Eastern Africa Low Income MOZ
9 Cameroon Middle Africa Middle Income CMR
11 Eswatini Southern Africa Middle Income SWZ
12 Zimbabwe Eastern Africa Low Income ZWE
13 Angola Middle Africa Middle Income AGO
14 Mauritania Western Africa Middle Income MRT
15 Namibia Southern Africa Middle Income NAM
17 Malawi Eastern Africa Low Income MWI
18 Guinea-Bissau Western Africa Low Income GNB
19 Lesotho Southern Africa Middle Income LSO
20 Cabo Verde Western Africa Middle Income CPV
21 Senegal Western Africa Low Income SEN
22 Gambia Western Africa Low Income GMB
24 Uganda Eastern Africa Low Income UGA
26 Burundi Eastern Africa Low Income BDI
27 Rwanda Eastern Africa Low Income RWA
28 United Republic of Tanzania Eastern Africa Low Income TZA
29 Ethiopia Eastern Africa Low Income ETH
30 South Africa Southern Africa Middle Income ZAF
32 Kenya Eastern Africa Middle Income KEN
42 Zambia Eastern Africa Middle Income ZMB
57 Eritrea Eastern Africa Low Income ERI
67 South Sudan Eastern Africa Low Income SSD
105 Botswana Southern Africa Middle Income BWA
162 Central African Republic Middle Africa Low Income CAF
163 Democratic Republic of the Congo Middle Africa Low Income COD
166 Nigeria Western Africa Middle Income NGA

Now, we can join our temp_df with the map_df dataframe using the "Country" common column:

In [26]:
map_df = pd.merge(left=map_df, right=temp_df, on="Country", how='left')
map_df
Out[26]:
Country Total Affected UN Sub Region Income group ISO
0 Angola 4922216 Middle Africa Middle Income AGO
1 Botswana 38000 Southern Africa Middle Income BWA
2 Burkina Faso 13250928 Western Africa Low Income BFA
3 Burundi 2412500 Eastern Africa Low Income BDI
4 Cabo Verde 146093 Western Africa Middle Income CPV
5 Cameroon 2401127 Middle Africa Middle Income CMR
6 Central African Republic 2221692 Middle Africa Low Income CAF
7 Chad 8822162 Middle Africa Low Income TCD
8 Democratic Republic of the Congo 25972806 Middle Africa Low Income COD
9 Djibouti 1025176 Eastern Africa Middle Income DJI
10 Eritrea 1700000 Eastern Africa Low Income ERI
11 Eswatini 2104000 Southern Africa Middle Income SWZ
12 Ethiopia 74705679 Eastern Africa Low Income ETH
13 Gambia 491100 Western Africa Low Income GMB
14 Guinea-Bissau 132000 Western Africa Low Income GNB
15 Kenya 29750000 Eastern Africa Middle Income KEN
16 Lesotho 3608515 Southern Africa Middle Income LSO
17 Madagascar 5665290 Eastern Africa Low Income MDG
18 Malawi 19727628 Eastern Africa Low Income MWI
19 Mali 13660753 Western Africa Low Income MLI
20 Mauritania 8205374 Western Africa Middle Income MRT
21 Mozambique 10262271 Eastern Africa Low Income MOZ
22 Namibia 2261000 Southern Africa Middle Income NAM
23 Niger 29303986 Western Africa Low Income NER
24 Nigeria 19110398 Western Africa Middle Income NGA
25 Rwanda 1000000 Eastern Africa Low Income RWA
26 Senegal 2093702 Western Africa Low Income SEN
27 Somalia 27535624 Eastern Africa Low Income SOM
28 South Africa 30450000 Southern Africa Middle Income ZAF
29 South Sudan 15623670 Eastern Africa Low Income SSD
30 Sudan 17839300 Northern Africa Middle Income SDN
31 Uganda 3542000 Eastern Africa Low Income UGA
32 United Republic of Tanzania 8954000 Eastern Africa Low Income TZA
33 Zambia 4210000 Eastern Africa Middle Income ZMB
34 Zimbabwe 21135118 Eastern Africa Low Income ZWE

Now, we have the aggregated table containing other values that we can use for our map. Next, let's group the "Total Affected" column into intervals reflecting the severity of occurence:

In [27]:
map_df.describe()
Out[27]:
Total Affected
count 3.500000e+01
mean 1.183669e+07
std 1.472545e+07
min 3.800000e+04
25% 2.162846e+06
50% 5.665290e+06
75% 1.847485e+07
max 7.470568e+07
In [28]:
# The box plot below shows us the distribution of the data. 
# This will help us identify the outliers and adequately group the data into intervals
px.box(data_frame=map_df, x='Total Affected')

Split the "Total Affected" column into intervals

In [31]:
bin_edges = [0, 5000000, 15000000, 30000000, map_df["Total Affected"].max()]
bin_labels = ["Low Severity (0-5M)", "Moderate Severity (>5M - 15M)", "High Severity (>15M - 30M)", "Critical Severity (>30M)"]
map_df["Severity Level"] = pd.cut(x=map_df["Total Affected"], bins=bin_edges, labels=bin_labels, right=True)
map_df
Out[31]:
Country Total Affected UN Sub Region Income group ISO Severity Level
0 Angola 4922216 Middle Africa Middle Income AGO Low Severity (0-5M)
1 Botswana 38000 Southern Africa Middle Income BWA Low Severity (0-5M)
2 Burkina Faso 13250928 Western Africa Low Income BFA Moderate Severity (>5M - 15M)
3 Burundi 2412500 Eastern Africa Low Income BDI Low Severity (0-5M)
4 Cabo Verde 146093 Western Africa Middle Income CPV Low Severity (0-5M)
5 Cameroon 2401127 Middle Africa Middle Income CMR Low Severity (0-5M)
6 Central African Republic 2221692 Middle Africa Low Income CAF Low Severity (0-5M)
7 Chad 8822162 Middle Africa Low Income TCD Moderate Severity (>5M - 15M)
8 Democratic Republic of the Congo 25972806 Middle Africa Low Income COD High Severity (>15M - 30M)
9 Djibouti 1025176 Eastern Africa Middle Income DJI Low Severity (0-5M)
10 Eritrea 1700000 Eastern Africa Low Income ERI Low Severity (0-5M)
11 Eswatini 2104000 Southern Africa Middle Income SWZ Low Severity (0-5M)
12 Ethiopia 74705679 Eastern Africa Low Income ETH Critical Severity (>30M)
13 Gambia 491100 Western Africa Low Income GMB Low Severity (0-5M)
14 Guinea-Bissau 132000 Western Africa Low Income GNB Low Severity (0-5M)
15 Kenya 29750000 Eastern Africa Middle Income KEN High Severity (>15M - 30M)
16 Lesotho 3608515 Southern Africa Middle Income LSO Low Severity (0-5M)
17 Madagascar 5665290 Eastern Africa Low Income MDG Moderate Severity (>5M - 15M)
18 Malawi 19727628 Eastern Africa Low Income MWI High Severity (>15M - 30M)
19 Mali 13660753 Western Africa Low Income MLI Moderate Severity (>5M - 15M)
20 Mauritania 8205374 Western Africa Middle Income MRT Moderate Severity (>5M - 15M)
21 Mozambique 10262271 Eastern Africa Low Income MOZ Moderate Severity (>5M - 15M)
22 Namibia 2261000 Southern Africa Middle Income NAM Low Severity (0-5M)
23 Niger 29303986 Western Africa Low Income NER High Severity (>15M - 30M)
24 Nigeria 19110398 Western Africa Middle Income NGA High Severity (>15M - 30M)
25 Rwanda 1000000 Eastern Africa Low Income RWA Low Severity (0-5M)
26 Senegal 2093702 Western Africa Low Income SEN Low Severity (0-5M)
27 Somalia 27535624 Eastern Africa Low Income SOM High Severity (>15M - 30M)
28 South Africa 30450000 Southern Africa Middle Income ZAF Critical Severity (>30M)
29 South Sudan 15623670 Eastern Africa Low Income SSD High Severity (>15M - 30M)
30 Sudan 17839300 Northern Africa Middle Income SDN High Severity (>15M - 30M)
31 Uganda 3542000 Eastern Africa Low Income UGA Low Severity (0-5M)
32 United Republic of Tanzania 8954000 Eastern Africa Low Income TZA Moderate Severity (>5M - 15M)
33 Zambia 4210000 Eastern Africa Middle Income ZMB Low Severity (0-5M)
34 Zimbabwe 21135118 Eastern Africa Low Income ZWE High Severity (>15M - 30M)

Finally, we plot the map with an actual data.

In [36]:
color_discrete_map = {
    "Critical Severity (>30M)": "#A70100",
    "High Severity (>15M - 30M)": "#D93F00",
    "Moderate Severity (>5M - 15M)": "#FD8E2A",
    "Low Severity (0-5M)": "#FFD983"
}

map_plot = px.choropleth(data_frame=map_df, locations="ISO", locationmode="ISO-3", scope='africa', 
              color='Severity Level', color_discrete_map=color_discrete_map,
             hover_data=map_df[["Severity Level", "Country", "Total Affected", "Income group", "UN Sub Region"]],
             height=600, width=1000,
                        labels="Country")

# update layout
map_plot.update_layout(title="Drought Severity Level by Country",
                      margin={"r":0, "t":40, "l":0, "b":0},
    legend=dict(x=0, y=0))
map_plot.show()
In [ ]:
map_plot.show()

Distribution of persons affected by drought by income group of the country

In [12]:
count_by_income = df.groupby("Income group").sum().reset_index()
count_by_income
C:\Users\User\AppData\Local\Temp\ipykernel_10208\1507261854.py:1: FutureWarning:

The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.

Out[12]:
Income group Total Affected
0 Low Income 288212909
1 Middle Income 126071199

Let's compute the percentage for each Income group:

In [13]:
count_by_income['Percentage'] = (count_by_income['Total Affected'] / sum(count_by_income['Total Affected'])) * 100
count_by_income
Out[13]:
Income group Total Affected Percentage
0 Low Income 288212909 69.568903
1 Middle Income 126071199 30.431097

We can as well visualize this:

In [15]:
px.bar(data_frame = count_by_income, x = "Income group", y = "Total Affected", text="Percentage")

What did you observe?¶

OBSERVATION: The bar chart shows that countries with low income were affected more than those in middle income by more than two times! It therefore gives a pointer that the income level of a country most definitely affects the number of persons involved in drought

Research Question 2¶

In [ ]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.

Conclusions¶